-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logstash detectors module #327
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @chungktran
Thanks for your contribution.
Some remarks as comments in the PR can be applied to all detectors:
- I would probably change all transformation to
true
(or replacemean
bymin
and delete all lasting functions) - I would delete all description which do not bring more information than the detector name (you wan put
too high
/too low
in the detector name if you want).
most of the detectors in this module do not rely on a "plug and play" approach as we try to do in general. instead, they need a specific threshold depending on each environment the user needs to set, for that you can:
- disable them by default to let the user ability to enable it if they want (and know how to configure them)
- let the threshold undefined to force the user to do it
- trying to use more relative and generic approach like the queue size approach I proposed in the PR
- mix all of these
this is not a problem but notice the default rollup for cumulative metrics are delta
so it should not be necessary to define it explicitly.
rollup: delta | ||
rules: | ||
major: | ||
description: is high |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would let the description undefined while it does not bring more valuable information than the default, this can apply to all other similar descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to leave the description b/c the default says too high
for both critical
and major
which to me is not right. When it's major
I want it say high
and when it's critical
I want it to say too high
which make perfect sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but no matter their severity both rules are too high compared to their respective threshold and that is true to everybody, not a personal meaning preference.
I would prefer to keep modules homogeneous to give a consistent behavior to the users.
That said we could open a request feature to create variables to give the user the ability to customize the rule description as he want.
rules: | ||
major: | ||
description: is high | ||
threshold: 25000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this detector thresholds highly depends on each environment, I would let threshold undefined to force the user to customize this according to his needs.
this can applied to all other detectors except may be the critical too low ones (equals to 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to the module to be as useful out of the box as much as possible without too much configuration. With your suggestion the end user will get error when doing terraform plan/apply
and that's very un-friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the goal to make all modules useful out the box as much as possible and this is why I suggest this in addition to my other suggestion to prefer relative based detectors (like percentage).
There are 2 possibilities:
- either the thresholds you choose have an universal or common logic so you can keep it but you just need to explain them (as Notes in readme, or as tip on detector, or mix them). e.g. on aws rds, the max connection is calculated depending on the instance type so we could specify a default value in this case which should match and work for any user who did not set a custom limit on their RDS.
- or these thresholds have been choose from your environment, with your requirements, your needs and your criteria so they have no sens or legitimacy for others and so they should not have been set as default in a generic module
the modules should be as plug and play as possible, meaning a user can deploy detectors without to customize anything and it works, users are used to this.
if you let default values for thresholds which must always be customized this is misleading for the user and a trap because they will not notice the problem until the moment they do not get alert when they want to.
if you do not set default values, so these variables will be automatically added to the module example usage in the readme to inform the user these variables are mandatory and yes if they forget them they will get an explicit and understandable error to force them to set, so no bad surprise in future because they deployed detectors which will never work in their environment (in my opinion this is more un-friendly)
disk: | ||
metric: node.stats.pipelines.queue.queue_size_in_bytes | ||
rollup: delta | ||
signal: | ||
formula: (disk / 1000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of applying a static threshold check to the queue which, again, fully depends on each environment may by can you try to make this check more generic by making a comparison between node.stats.pipelines.queue.max_queue_size_in_bytes
and node.stats.pipelines.queue.queue_size_in_bytes
.
disk: | |
metric: node.stats.pipelines.queue.queue_size_in_bytes | |
rollup: delta | |
signal: | |
formula: (disk / 1000000) | |
size: | |
metric: node.stats.pipelines.queue.queue_size_in_bytes | |
max: | |
metric: node.stats.pipelines.queue.max_queue_size_in_bytes | |
signal: | |
formula: (size/max).scale(100).fill(value=0).publish('signal') |
from this way you can define relative percentage based threshold which could fit to everybody.
also, it seems redundant to the previous one event count detector. you can keep it but you should probably use disabled: true
to disable it by default. this allow interested users to keep a hard queue limit count for who know it but don't enable alert by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monitoring disk size in Logstash queue is different than the queued events in the pipeline. The disk size is cumulative of all pipelines so it is not quite the same. I'd like to keep this as static threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok it was only a suggestion to make this detector more generic. while most of your detectors use absolute thresholds they are not usable and useful out of the box for other users as seen above.
@xp-1000 Thanks for all your comments and suggestions. I will make the changes accordingly. |
@xp-1000 I've made changes to some of the suggested changes. Please review. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to take the time to make the changes and to answer to my remarks.
I answered too, homogeneity and generic purpose of the modules in this repository make them constraining for sure but I believe we must find a cursor between the simplicity of usage and the relevance of the detectors for everybody.
@xp-1000 This is the replacement PR for PR 302 that I did in the past. This is using the generator instead. Please review.